Random drift versus selection in academic vocabulary: an 
evolutionary analysis of published keywords 
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Abstract 

The evolution of vocabulary in academic pub- 
lishing is characterized via keyword frequen- 
cies recorded the ISI Web of Science citations 
database. In four distinct case-studies, evolu- 
tionary analysis of keyword frequency change 
through time is compared to a model of ran- 
dom copying used as the null hypothesis, such 
that selection may be identified against it. The 
case studies from the physical sciences indicate 
greater selection in keyword choice than in the 
social sciences. Similar evolutionary analyses 
can be applied to a wide range of phenom- 
ena; wherever the popularity of multiple items 
through time has been recorded, as with web 
searches, or sales of popular music and books, 
for example. 



1 Introduction 

Ideally, science is the systematic process of 
testing multiple hypotheses, but as practiced 
by real people, it is also distinctly social. 
Within complex collaboration networks, aca- 
demics compete for citations, particularly in 
our modern era of online citation databases 
that can 'summarize' an academic's career at 
a single command [23] El E3 EE]- They are 
therefore prone to copy ideas, and particularly 
buzzwords, from one another [21 126] . 

Diverse opinions exist as to what constitutes 
trendy ideas versus more meaningful research 



paradigms; the challenge is to evaluate this 
by some objective means. In other realms of 
fashion, ranked lists are increasingly a part 
of our world; from universities to Internet 
searches, downloads, book and music sales. 
Correspondingly, the design of algorithms 
needed to track 'what's hot and what's not' 
has itself become a hot topic in computer 
science |£K Indeed, as journals are now ranked 
by their impact factor - increasingly a subject 
of study [TOl [16] - there is no reason why we 
cannot look at academic buzzwords the same 
way: rank them in order of popularity from 
year to year, and track the comings and goings 
of 'what's hot' on such lists. 

As the science of how attributes are passed 
on and modified through time |20j . evolution- 
ary theory is an ideal means to model these 
aspects of scientific process [TT]. Previous 
work using evolutionary models has shown, 
counter-intuitively, that many patterns of 
change in cultural choices over time can be 
explained as random drift; i.e. the effect of 
chance on what happens to be copied, together 
with the occasional appearance of innovations 
[2T] ESI [S]. Meaningful selection, as opposed to 
random copying, occurs when such choices are 
made on the basis of something inherent to the 
choice itself [5J - as with a 'better mousetrap' 
for example, or something inherently preferable 
to human tastes. 

In knowledge production, ideas are not 
always adopted out of inherent superiority, but 



often merely because others are using those 
ideas. In cither case, the transmission process 
is evolutionary; predominantly one of adopting 
what others have done, with creative modifi- 
cations contributing new ideas that eventually 
replace old ones through being adopted. 'Ideas' 
of course is a nebulous description, so this 
study focuses specifically on the evolution of 
keyword use in academic publishing. 

By analyzing keyword frequencies as 
recorded in a citations database, one can 
characterize their replication in terms of a 
continuum between (a) random copying of 
fashionable buzzwords at one extreme (akin 
to random genetic drift), and (b) independent 
selection of keywords, based on inherent quali- 
ties, at the other (falsifying the neutral model) . 
The question is one of degree, with variation 
expected along this basic continuum. Using 
random copying as the null hypothesis, one 
can simply seek to identify selection against 
the null without characterizing it specifically; 
although clearly the first hypothesis is that 
words are selected for usefully describing 
something real and relevant to the topic. 

It may seem cynical to assume first that 
keywords are copied without much thought, 
but several studies suggest this [2,3,9,12] 
and even George Orwell thought as much 
in his famous 1946 essay, 'Politics and the 
English language.' As the null hypothesis, 
random copying does not mean that the words 
themselves are chosen randomly, but that they 
are copied randomly from others who have 
already used them. The assumption is that 
randomly-copied keywords are value-neutral, 
in that no keyword is inherently more valuable 
than any other - the likelihood of any being 
chosen is simply proportional to its current 
popularity. This is in essence the neutral 
model of population genetics [7J [5D] . 

In previous simulations, the random copy- 
ing, or neutral, model has been represented 
as follows: Start with a set of N individuals, 
which are replaced by N new individuals in 
each generation. Over successive generations, 
each of the N new individuals copies its 
variant from a randomly-selected individual 
in the previous generation, with exception 



of a small fraction, /x (< 5%), of the N new 
individuals who invent a new variant in the 
current generation. 

The neutral model is simple to simulate, 
yet has been shown to provide richly complex 
results that produce at least three useful 
predictions relevant to cultural drift [2TJ |52] : 

1. If individual variants are tracked through 
the generations, their frequencies (relative 
popularities) will change in a stochastic 
manner, as opposed to a directed man- 
ner or completely random manner. More 
specifically, the haploid neutral model pre- 
dicts that the only source of change in vari- 
ant frequencies over time is random sam- 
pling, such that (3): 



V = 



"(1 - v) 

N ' 



(1) 



where V is the variance in frequencies from 
one time step to the next, and v < 1 is the 
relative frequency of the variant as frac- 
tion of N, the maximum possible number 
of variant copies per generation. For small 
v, v(l — v) ~ v, which after rearranging 
cq. (1) indicates that NV/v ~ 1. 

2. Like many processes of proportional ad- 
vantage (under random copying the chance 
of being copied is proportional to current 
frequency), the variant frequencies exhibit 
a long-tailed distribution, which for small 
values of fi follows a power law form [STJ [5] . 
This is one of the less diagnostic predic- 
tions, as a variety of mechanisms can gen- 
erate power law and related distributions 
[19]. Nonetheless, the distribution is use- 
ful as a null expectation. Among the possi- 
ble departures from this null, selective bias 
for novelty (e.g., some maximum thresh- 
old of popularity) should truncate the tail 
(high end) of the variant frequency distri- 
bution [3j [14]. Alternatively, there might 
be a conformist bias resulting in a 'winner 
take all' distribution, whereby one word 
has a higher frequency than predicted by 
the power law for the rest of the words. 

3. There is continual turnover in the variant 
pool. If the variants are ranked in order 



of decreasing frequency, the turnover z in 
that list over successive generations (time) 
depends much more strongly on fi than on 
N [22], such that: 



Vm 



(2) 



where z is measured as the fraction of 
turnover in the list (e.g., two items re- 
placed in a Top 10 list would be 20% 
turnover). In contrast to random copy- 
ing, under selection the population size 
TV should correlate positively with the 
turnover rate in the ranked list of most 
popular variants [22]. 

Using these three predictions as the null model, 
it is easier identify selection, which is effec- 
tively demonstrated by departures from these 
patterns, dependent on the kind of selection op- 
erating. 

In applying this to keyword use, let N repre- 
sent the number of keywords in a given time pe- 
riod (rather than the number of articles, which 
vary in their number of keywords). This en- 
sures that each individual corresponds with ex- 
actly one variant. The invention rate (i is then 
the fraction of those words in each time interval 
that are appearing for the first time. 

2 Data 

The data used in this analysis were taken from 
Thompson Scientific's 'Web of Science' (WoS) 
database, which covers articles thousands of 
journals in science and engineering, social sci- 
ences, arts and humanities. Among the wealth 
of information provided, each journal article 
description in the WoS database contains the 
title, keywords and abstract, references cited, 
and a list of all papers in other journals that 
have cited the paper to date. 

As listed on the WoS database, the four case 
studies presented here provide a test of dif- 
ferences of keyword use among published ar- 
ticles within older paradigms versus younger 
ones, and within the physical sciences versus 
the social sciences. In order to define these case 
studies, we need a working definition of a sub- 
field of academic publishing. If belabored, this 



could be quite a difficult task - many definitions 
would be too subjective, variable or broad. 

A way forward is to define a scientific 
'paradigm' |13] as comprising the scientific pa- 
pers that were in some way inspired by a cer- 
tain highly-influential paper. We thus can de- 
fine each academic paradigm as the set of all 
papers that cited a certain highly-cited paper. 
The citing papers may occur in a range of dif- 
ferent journals, but they will all share the defin- 
ing characteristic of citing the highly- influential 
work. 

Consider four highly-cited, seminal works, 
two from the natural sciences and two from 
the social sciences. To see the effect of time, 
from the pair in each category we include one 
work about 30 years old and the other about 
ten years old. This provides two comparisons: 
older versus younger fields of study, and social 
sciences versus physical sciences. 

From the physical sciences we have a paper 
by Barabasi and Albert (PS99, for 'physical sci- 
ences, 1999') in 1999 [T], which introduced a 
quantitative model of 'scale-free networks' and 
has been cited over 2,000 times (as listed on the 
WoS database) , and one by Witten and Sander 
(PS81) from 1981 [30, which introduced the 
physics model of 'diffusion limited aggregation', 
and has been cited over 1,300 times. From the 
social sciences, there is a paper by Nahapiet 
and Ghoshal (SS98) in 1998 [18 , cited over 460 
times, which reviewed the influential concept 
of 'social capital', and a 1977 book by Bor- 
dieu (SS77), cited over 2,700 times, which in- 
troduced such influential concepts as 'agency' 
and 'structuration' into the social sciences [I]. 

For each of the sets of articles within each 
defined paradigm, the keywords data from the 
WoS database were taken only from titles and 
keywords chosen by the authors (not the WoS 
'Keywords plus' which is an automated conden- 
sation of the cited references) , and then sorted 
by publication year. The following common 
words were removed from the data: a, an, and, 
as, by, for, from, in, its, of, on, the, to, using, 
and with. Aside from these, no other common 
words were present in high enough frequencies 
to significantly affect the patterns discussed be- 
low. 



3 Results 

Figure 1 shows the temporal change in N, the 
number of keywords for each case study per 
year, and in N/j,, the number of new keywords 
per year, for paradigms about 10 years old (Fig- 
ure la) and 30 years old (Figure lb). A new 
keyword was one which had not appeared in 
the record beforehand, with records starting in 
1994 for the older works and date of publica- 
tion (1998, 1999) for the younger paradigms. 

Table 1 shows additional statistics for each 
paradigm averaged from 2002 to 2006, the sam- 
ple period applicable to all four case studies 
(the newer case studies do not have enough data 
before 2002). In each case, the quantities N 
and Nfi parallel each other (Figure 1), indicat- 
ing a consistent and substantial invention rate 
H between 15 and 30% in all cases (Table 1). 
Within the older pair and the younger pair of 
paradigms, the invention rate /i was higher for 
the social science than for the physical science 
case (Table 1). This is true even though the 
comparison differs in the number of words: N 
is larger for PS99 than SS98, but lower for PS81 
than SS77. 

Table 1: Average values, from 2002-2006, 

of the number of keywords N, newly appearing 
keywords N/j,, and different keywords or 'vo- 
cabulary'. The invention fraction \x is shown as 
a range, representing the decline in this value 
over the time period. 

SS77 PS81 PS99 SS98 



N 


1671 


1050 


2660 


885 


Vocab 


1036 


566 


979 


431 


Nn 


441 


192 


511 


224 


fh% 


45-18 


28-16 


48-13 


52-14 



In addition to a higher innovation fraction 
for the social science paradigms, there is also a 
marked difference in the turnover in keywords. 
Consider the top 5 keywords, in terms of pop- 
ularity, over the years in each case study (be- 
low the top 5, keywords start to become insuf- 
ficient in their numbers of appearances). As 
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Figure 1: Keywords, total and new, among 
paradigms about (a) 10 and (b) 30 years 
old. Social science cases in red and physical 
sciences in black. Solid curves show the to- 
tal number of keywords N per year, and the 
dashed curve shows number of new keywords 
Nfi introduced per year. Logarithmic y-axis. 



the best way to view overall trends in turnover, 
Figure 2 shows the cumulative turnover in the 
top 5 keywords, expressed as a fraction (e.g., 4 
words having passed through the top 5 = 80% 
turnover). In the physical science paradigms, 
the turnover in the top 5 keywords leveled off 
to virtually no turnover in the last several years. 
At the other end of the spectrum, the key- 
words in the social science cases show a high 
and steady turnover throughout the sampling 
period (Figure 2). In the case of SS77, this 
turnover persisted long after its publication, 
and many years beyond which PS81 had lev- 
eled off. 

Whereas the continual turnover in SS77 and 
SS98 is consistent with random copying with in- 
novation, the cessation of turnover in PS81 and 
especially PS99 suggests selection. As Figure 
3a shows, the selective sorting of the keyword 
frequencies for PS99 was strong enough that 
even the keyword networks (highlighted in red) 
occupies a distinct frequency ranking from the 
singular network (blue), while other entries are 
similarly locked into their positions among the 
top 5. Although this pattern of selection is not 
as strong in the older physical science paradigm 
(PS81), the blue versus black lines in Figure 3c 
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Figure 2: Cumulative turnover in the top 
5 keywords. Social science cases shown in red 
and physical sciences in black. Turnover refers 
to words making a first appearance in the top 
5. For the older paradigms (SS77; PS81), sym- 
bols are squares and the count begins at 1994, 
for the newer articles (PS99; SS98) symbols are 
circles and the count begins the year after pub- 
lication. 



tion. In the PS99 case, the word complex (word 
3, NV/v = 5.8) appears to have been selected 
for, as it doubled in frequency from 2002 to 
2006 beyond what would be expected from ran- 
dom drift. Also in the PS99 case, networks 
(word 1) declined steadily as network (word 2) 
increased, such that their variability scores are 
near 2. By contrast, the words in the SS98 case 
do not show such directionality in their change 
(Figure 3b), and the high variability scores for 
four of the five words (Table 2) is due to their 
fluctuating frequencies over the time interval 
(Figure 4b). Curiously, in the PS81 case, the 
word aggregation (word 2, NV/v = 0.3) was 
considerably less variable than diffusion (word 
5, NV/v — 2.7) even though the seminal paper 
[3"U] was about diffusion-limited aggregation. 

Table 2: Values of NV/v for the top 5 
words, 2002-2006, tracked in Figure 3. Num- 
bers in parentheses give standard error on the 
trailing digits. 



show apparent groupings of words by selected 
frequencies. In contrast, both the older and 
younger social science cases (SS77 and SS98) 
appear more stochastic in their histories of in- 
dividual word frequencies (Figure 3 b, d), and 
with each at a relatively low frequency com- 
pared to the network science case (Figure 3 
a, c). In the SS77 case, the ratio NV/v in- 
creases moving down the rankings (Table 2), 
which suggests a possible conformist bias, in 
that the more frequent words have been prefer- 
entially selected (e.g. red curve in Figure 3d). 
As described above, the ratio NV/v can be 
used to characterize keyword variability, allow- 
ing comparison across cases studies for the pe- 
riod 2002-2006 (Table 2). Averaged over the 
five keywords, NV/v differs more by age of the 
paradigm than by subject matter, being higher 
for the younger (2.3) than the older (1.3-1.4) 
paradigms. Within each age pair, however, the 
physical sciences paradigm has the larger stan- 
dard error in the mean value of NV/v (Ta- 
ble 2). This reflects certain keywords in the 
physical science paradigms whose popularity 
changed directionally, apparently due to selec- 



SS77 



PS81 



PS99 



SS98 



Wdl 


0.82 


1.52 


2.10 


2.73 


Wd2 


1.13 


0.34 


1.82 


2.75 


Wd3 


1.10 


0.71 


5.85 


0.97 


Wd4 


1.46 


1.55 


1.28 


2.66 


Wd5 


1.75 


2.69 


0.59 


2.19 



Ave 1.25(16) 1.36(41) 2.33(92) 2.26(34) 



Finally, consider keyword frequency distribu- 
tions for two time-slices, years 2001 and 2005 
(Figure 4). All show essentially a power law 
form, which could be consistent with either the 
neutral model but also a variety of models of 
proportionate advantage |19j . More revealing 
is the degree of change in the power law expo- 
nent (slope on the log-log plot) over this 4- year 
time span. In three cases, the slope is nearly 
the same for 2005 as for 2001, but for PS99, the 
slope is considerably less for 2005. The decreas- 
ing slope for PS99 correlates with a decreasing 
invention rate \x over this time span (Table 1), 
and reflects the diminishing probability for any 
new keyword to ever reach the top 5. 
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Figure 3: Frequencies of the top 5 key- 
words of 2005. Shown are the four paradigm 
case studies, including: (a) newer physical sci- 
ences (PS99); (b) newer social sciences (SS98); 
(c) older physical sciences (PS81); and (d) older 
social sciences (SS77). Logarithmic y-axes. 



Figure 4: Cumulative frequency distribu- 
tions of all keywords. Open circles show dis- 
tribution for 2001 and filled circles are for 2005. 
The paradigms are (a) newer physical sciences 
(PS99), (b) newer social sciences (SS98), (c) 
older physical sciences (PS81) and (d) older so- 
cial sciences (SS77). Using the least-squares 
method |19j . the estimated power-law expo- 
nents for 2001 and 2005, respectively, are as fol- 
lows: PS99: 2.11, 2.00; SS98: 2.11, 2.02; PS81: 
2.09, 2.05; SS77: 2.18, 2.09. Errors (by jack- 
knife estimate) on these exponents are < 0.01. 



The frequency distributions in Figure 4 en- 
able the identification of copying biases. Al- 
though all four paradigms yield essentially 
power law distributions, in two cases - PS81 
and SS98 - show marked departures from a 
power law in the truncations of the tail (Fig- 
ure 4b and 4c). In each case there appears to 
be selection for the top 3 or 4 words (and they 
are the same words in 2001 and 2005 for each 
case), such that their frequencies are roughly 
the same rather than following the power law. 



4 Discussion &; conclusions 

By treating academic keywords as discrete ele- 
ments of evolution, this study finds that differ- 
ent academic niches - as defined by sets of pub- 
lications which share a single seminal article in 
their cited reference lists - can show markedly 
different evolutionary patterns. From the case 
studies considered, it appears that some aca- 
demic fields are characterized by a high de- 
gree of drift, resulting in continual and unpre- 
dictable change in vocabulary, whereas in oth- 
ers words appear under selection, such that the 
predominant vocabulary becomes increasingly 
crystallized and unchanging over time. 

Among the cases presented, the social science 
paradigms showed the stronger patterns of ran- 
dom copying with invention, including constant 
turnover in the keywords of highest frequency, 
and the stochastic ups-and-downs of individual 
word frequencies over time. In contrast, the 
physical science paradigms showed a rejection 
of the neutral model, particularly in the cessa- 
tion of turnover in the top keywords over time. 

The scale of analysis is a key variable; a 
certain group of keywords might be selected, 
yet drifting within the group. Similarly, in a 
different study, while choices of baby names 
for the entire United States are indistinguish- 
able from random copying [8J, different ethnic 
groups certainly select from different pools of 
names 5 , and it remains to be studied whether 
random drift would predominate again within 
these groups. 

In addition to these particular points, this 
study is meant to demonstrate how a simi- 



lar evolutionary analysis could be performed 
on any cultural dataset comprising discrete el- 
ements. This evolutionary analysis contrasts 
with the increasing representation of knowledge 
growth as networks [SSJ HH1 H2] with the indi- 
viduals (e.g. authors) as 'nodes', and their in- 
teractions (e.g. cited references) as 'links'. A 
particular challenge for network analysis, how- 
ever, is change, because a network implies a 
structure to interactions - the connections of 
today determine what will happen tomorrow, 
such that change must be modeled as a mod- 
ification of the existing network. However, in 
fashionable realms, yesterday might be less im- 
portant than tomorrow, and interactions of in- 
fluence may differ completely from one day to 
the next. Change can be the essence of the pro- 
cess, rather than just a modification. 

For this reason, evolutionary theory can of- 
ten naturally account for change that may 
be seen as exceptional in a network model 
[TT1 IT21 |2U| . A recent network analysis |BJ, 
for example, tracked coauthorships and mo- 
bile phone calls to show that, in order to have 
longevity, small groups require stability in their 
composition, whereas large groups last a bit 
longer with some degree of turnover in their 
membership. This is, in fact, a basic prediction 
of the genetic drift model: small populations 
are destroyed by drift, large populations can 
tolerate it and even find it adaptive. The cru- 
cial difference is that in the network analysis 
[5] mutation was measured as absolute num- 
ber of changes, whereas the random copying 
model defines mutation fj, as a fraction of N. 
Hence the random copying model would have 
predicted the network result, in that coherence 
disintegrates more quickly with one mutation 
per time step in a population of 4 versus a pop- 
ulation of 100, for example, because the former 
is a much higher mutation rate. 

Change, in fact, is central to evolutionary 
theory. The use of some basic evolutionary 
analyses, with parallels in population genet- 
ics, can be used to characterize different forms 
of innovation and transmission of discrete cul- 
tural elements. Identifying what proceeds in 
predictable directions, as opposed to drifting 
upon the tides of fashion, would be of great 



utility in understanding the evolution of knowl- 
edge. It is wasted effort to try to predict the 
future of randomly drifting fashionable buz- 
zwords [U [TS] , but one might hope to predict 
selected elements, such as valid new scientific 
terms. The kind of evolutionary analysis used 
here is generally applicable to any case study 
where popularity can be presented in the form 
of frequencies and ranked lists over time. 
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